Focused Web Corpus Crawling
نویسندگان
چکیده
In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web corpora or web indexes (for example, pages with little text or virtually no text at all). An optimized crawler for web corpus construction would ideally avoid crawling such content in the first place, saving bandwidth, storage, and post-processing costs. In this paper, we show in three experiments that two simple scores are suitable to improve the ratio between corpus size and crawling effort for web corpus construction. The first score is related to overall text quality of the page containing the link, the other one is related to the likelihood that the local block enclosing a link is boilerplate. 1 Crawl Optimization and Yield Ratios Optimizing a crawling strategy consists in maximizing its weighted coverage WC(t) at any time t during a crawl (Olston and Najork, 2010, 29), i. e., the summed weight of the documents downloaded until t, where the weight of each crawled document is calculated as a measure of the usefulness of the document relative to the purpose of the crawl. To maximize WC, it is vital to guess the weight of the documents behind harvested links before download, such that documents with potentially lesser weight have a lower probability of being downloaded. So-called focused crawlers (in a broad sense) are designed to maximize WC with respect to some specific definition of document weight, for example when documents with a high search-engine relevance (measured as its PageRank or a similar score), documents about specific subjects, or documents in a specific language are desired (Chakrabarti et al., 1999; Menczer et al., 2004; Baykan et al., 2008; Safran et al., 2012). For our purpose, i. e., web corpus crawling, a document with a high weight can simply be defined as one which is not removed from the corpus by the post-processing tools due to low linguistic quality and/or a document which contributes a high amount of text to the corpus. Recently, an interesting approach to crawl optimization along such lines was suggested which relies on statistics about the corpus yield from known hosts (Suchomel and Pomikálek, 2012). Under this approach, the weight (rather of a whole web host) is taken to be the ratio of good documents from the host remaining in the corpus after a specific post-processing chain has been applied to the documents. Harvested URLs pointing to certain hosts are prioritized accordingly. We follow a similar route like Suchomel and Pomikálek, but look at documentlocal features instead of host statistics. Throughout this paper, we refer to the yield ratio instead of WC, although they are related notions. We define the yield ratio Yd for a set Dc of crawled unprocessed documents and a set Dr of retained documents after filtering and processing for inclusion in a corpus, with Dr ⊂ Dc, as:
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملLanguage Specific and Topic Focused Web Crawling
We describe an experiment on collecting large language and topic specific corpora automatically by using a focused Web crawler. Our crawler combines efficient crawling techniques with a common text classification tool. Given a sample corpus of medical documents, we automatically extract query phrases and then acquire seed URLs with a standard search engine. Starting from these seed URLs, the cr...
متن کاملComparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domainspecific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms usin...
متن کاملFocused Crawling Techniques
The need for more and more specific reply to a web search query has prompted researchers to work on focused web crawling techniques for web spiders. Variety of lexical and link based approaches of focused web crawling are introduced in the paper highlighting important aspects of each. General Terms Focused Web Crawling, Algorithms, Crawling Techniques.
متن کاملLexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling
Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing...
متن کامل